RNA splicing analysis using heterogeneous and large RNA-seq datasets

The ubiquity of RNA-seq has led to many methods that use RNA-seq data to analyze variations in RNA splicing. However, available methods are not well suited for handling heterogeneous and large datasets. Such datasets scale to thousands of samples across dozens of experimental conditions, exhibit increased variability compared to biological replicates, and involve thousands of unannotated splice variants resulting in increased transcriptome complexity. We describe here a suite of algorithms and tools implemented in the MAJIQ v2 package to address challenges in detection, quantification, and visualization of splicing variations from such datasets. Using both large scale synthetic data and GTEx v8 as benchmark datasets, we assess the advantages of MAJIQ v2 compared to existing methods. We then apply MAJIQ v2 package to analyze differential splicing across 2,335 samples from 13 brain subregions, demonstrating its ability to offer insights into brain subregion-specific splicing regulation.

† Equal contribution as co-first authors. * To whom correspondence should be addressed; e-mail: yosephb@upenn.edu.

Supplementary Methods
Supplementary Note: procedure for bootstrapped readrates from perposition reads MAJIQ's bootstrapping procedure can be defined as follows. Without loss of generality, consider a single junction. For each RNA-seq read aligned with a split for this junction, we define the read's position relative to the junction (or vice-versa) with a position i and count the number of reads associated with each position, which we call S i .
These raw readrates include biases that we would like to correct for; in particular, we define Other methods typically sum directly over positions R i (really S i since they generally also ignore read stacks) to produce a total junction readrate for use in quantification: (observed total junction readrate) Since we are unsatisfied with uncertainty/variance accounted for by directly using R, we generate samples from a bootstrap distribution over the P nonzero positions.
If we make the assumption that we are given the number of nonzero positions P and that the underlying readrate for each of these positions is independent and identically distributed with finite mean E [R i ] = µ and variance V [R i ] = σ 2 , we can derive the mean and variance of our observed total readrate: =µP, (observed total readrate mean) (observed total readrate variance) If we were able to take two samples for the observed total readrate (i.e. R and R ′ ), their difference has mean 0 and variance 2P σ 2 .
We define our bootstrapping procedure over observed nonzero reads R 1 , . . . , R P to generate bootstrapped total readsR,R ′ , . . . such that the variance of the difference between bootstrap samples would be equivalent to that of the difference between two samples from the true distribution (i.e. 2P σ 2 ). In order to do this, we take P − 1 samples from {R 1 , . . . , R P } with replacement and scale their sum by P/(P − 1).
It is straightforward to see that the bootstrapped total readrate has the same mean as the observed total readrate. In order to prove that the variance of the difference between two sample matches, we note that the covariance Cov R Z k , R Z k ′ between any two draws from the observed per-position readrates with Z i ∼ Uniform(P ) is: . We note that E [R i R j ] = σ 2 δ ij + µ 2 (where δ ij is the Kroencker delta). When k = k ′ , it follows Otherwise, the law of total expectation gives: Combining the two cases, we have: (second moment sampled readrate) Therefore, (covariance sampled readrate) Thus, the variance of the bootstrapped total readrate is (true bootstrap readrate variance) But we want the variance of the difference between two samples from the bootstrap procedure.
So, we calculate the covariance between two distinct samples: (covariance between samples ofP) Therefore, we find that: (bootstrap total readrate variance as difference) In practice, the observed nonzero positions can lead to a bootstrap distribution with variance less than its mean (underdispersed). We generally expect readrates to follow a Poisson or negative binomial (overdispersed) distribution, so in these cases, we fall back to parametric bootstrapping with a Poisson distribution with mean µP . Otherwise, we use the nonparametric bootstrap sampling procedure as described above.     Supplementary Fig. 6: Reproducibility ratio plots ranking by p-value or dPSI. Most methods provides both a p-value and dPSI, and they are used to filter the events, but how to rank the resulting set of events is not defined by the authors. We assessed the effect of ranking first by each tool's p-value or first by dPSI on the resulting RR plots for different group sizes (3,5,15,50). In all cases we find ranking by dPSI, as used in main Fig 2d, to gives better results.  PTTARY  BRNCHB  THYROID  SPLEEN  OVARY  VAGINA  SKINS  UTERUS  LUNG  SNTTRM  BREAST  SKINNS  CRVXECT  CRVXEND  SLVRYG  PNCREAS  FLLPNT  LIVER  ADPSBQ  TESTIS  CLNTRN  CLNSGM  ARTCRN  STMACH  BRNCTXA  KDNCTX  ESPMCS  BLADDER  ARTAORT  ESPMSL  ADPVSC  ADRNLG  ESPGEJ  BRNCDT  LCL  BRNSPC  WHLBLD  HRTAA  BRNHPT  BRNSNG  BRNNCC  BRNACC  HRTLV  ARTTBL  BRNHPP  BRNAMY  BRNCTXB  BRNPTM  FIBRBLS  MSCLSK   NERVET  PRSTTE  BRNCHA  PTTARY  THYROID  VAGINA  OVARY  SPLEEN  SKINS  LUNG  BRNCHB  UTERUS  BREAST  SKINNS  SLVRYG  SNTTRM  CRVXECT  CRVXEND  LIVER  ADPSBQ  PNCREAS  FLLPNT  TESTIS  CLNTRN  ARTCRN  STMACH  CLNSGM  ESPMCS  BLADDER  ADPVSC  ARTAORT  KDNCTX  ADRNLG  LCL  ESPMSL  HRTAA  ESPGEJ  WHLBLD  BRNCTXA  BRNCDT  BRNHPT  BRNSPC  HRTLV  ARTTBL  BRNNCC  BRNSNG  BRNAMY  BRNHPP  BRNACC  BRNPTM  FIBRBLS  BRNCTXB (4)). (f) Same as in (c), but shown for QKI hexamers (ACUAAY).